\chapter{General \commonlisp{} style guide}
\label{chap-general-commonlisp-style-guide}

\section{Purpose of style restrictions}

The purpose of imposing a particular style is based on a few simple
facts that hold true for both natural languages and programming
languages:

\begin{itemize}
\item The set of all idiomatic phrases is a tiny subset of the set of
  all grammatical phrases.
\item The main purpose of these phrases is to serve as communication
  between people.
\end{itemize}

To illustrate the first fact, consider a natural language such as
English.  In English, we say ``tooth brush'', but ``dental floss''.
The words ``dental brush'' and ``tooth floss'' would be perfectly
grammatical, but they are not used.  A person trying to communicate
with other people must use the words that have been widely agreed
upon, even though some other words are perfectly legitimate.  It might
seem that such idiosyncrasies would be limited to languages with
multiple heritage such as English, but that is not the case.  In
French, we say ``brosse à dents'', ``pâte dentifrice'', and ``fil
dentaire''.  There is a large number of reasonable combinations, but
only one is used.

The same thing is true for programming languages.  The community has
collectively decided on a particular subset of all the grammatical
phrases, and a programmer who wishes to communicate with other
programmers should stick to that subset.

It should also be emphasized that the choice of idioms is different in
different languages.  An example from natural languages would be that
in English we say ``I wash my hands'', in French ``I wash myself the
hands'', and in Swedish we say ``I wash the hands [on myself]''.  Just
as it would be pointless trying to use an idiom from one language in a
translated version in a different language, it is as pointless to
translate idioms from one programming language to a different
programming language.

Finally, the choice of what phrases are idioms and what phrases are
not, is almost totally arbitrary, and based on coincidences of
history.  Therefore it is rarely productive to ask oneself \emph{why}
a particular phrase is an idiom and a different one is not.  There is
no possible enlightening answer to such a question.


\section{Width of a line of code}

The purpose of this section is to determine the preferred maximum
number of characters in a line of program text in a way that:
\begin{itemize}
\item the width of a text window can be chosen so that few or no lines
  are wrapped, and
\item as much program text as possible can be presented to the
  programmer on the kind of screen used for programming.
\end{itemize}

There is no general way of determining an optimal width of a line of
program text, because the value depends on factors that are individual
to the programmer as well as to the installation used for programming.
Some such factors are:

\begin{itemize}
\item Font size.  Some programmers can comfortably view very small
  fonts; others require bigger ones.
\item Size of monitor, both in pixels and in physical dimensions.
\item Viewing distance.
\item The condition of the eyes of the programmer.
\end{itemize}

However, we need to establish some kind of convention without taking
such factors into account.

Clearly, if we make the maximum width of a line too small, then
compound forms most more often be folded over several lines, thereby
increasing the number of lines of code and decreasing the total amount
of program text that can be viewed with a particular window height.

Conversely, if we make the maximum width too large, and we adapt the
width of the window accordingly, then most lines that are
significantly shorter will contain large amounts of white space, and
the width of the window may make it impossible to have another window
next to the current one on the same screen.

We think that a good compromise is 80 characters.  This value used to
be a hard limit, because some printers or printer drivers would
truncate longer lines.  Since it is less common to print code these
days, the limit is now soft.

\section{Documentation strings}
\label{sec-general-coding-style-documentation-strings}

Documentation strings are meant for the people who write \emph{client
  code}, i.e., code that makes use of the feature (function, class,
etc.) being documented by a documentation string.  Do not confuse the
target audience for documentation strings with that for
\emph{comments} \seesec{sec-general-coding-style-commenting}.

This intended purpose of documentation strings has a few consequences
for the coding style.  First, documentation strings are largely
\emph{noise} to a person reading the code being documented.  The
reason is simple; presumably the person reading the code already knows
how it is intended to be used, so the documentation string contains
information already known to that person.  An extensive documentation
string for some code may very well contain much more text than the
code itself, thereby making the code harder to read if the two are
physically close.  Furthermore, a multi-line documentation string can
often look unappealing and distracting if most lines start in the
first text column.  An unfortunate and frequent consequence of these
disadvantages is that programmers have a tendency to write skimpy
documentation strings.

One possible remedy for the noise problem is to separate the
documentation strings from the code, by taking advantage of the
function \texttt{(setf documentation)}.  The documentation string can
then be placed \emph{after} the code it documents, or even in a
separate file from the one containing the code.  One possible problem
with this separation is that it is then riskier that the documentation
becomes incorrect as the code evolves.  However, this risk is not as
great as the analogous one for comments, since it is (or should be)
much less common that client-visible code has its interface altered.

A remedy for the problem of unappealing documentation strings, whether
together with the code or in a separate file, is to use
\texttt{format} at read time, and to use the tilde-newline reader
macro with the \texttt{@} modifier, so that lines starting with the
second one can be indented so that they align with the first one.

Another important consequence of the purpose of documentation strings
is that there is absolutely no reason to provide documentation strings
for features that are not part of the protocol used by client code.
It is fairly common to see documentation strings provided for features
that are used only internally in a module.  This practice should be
avoided, especially since the documentation string is largely noise to
the code maintainer, as described above.

\section{Commenting}
\label{sec-general-coding-style-commenting}

Comments are meant for people who are also reading the code, typically
maintainers of the software.  Do not confuse the target audience for
comments with that for \emph{documentation strings}
\seesec{sec-general-coding-style-documentation-strings}.

Comments have the annoying disadvantage that they can not be verified
by the compiler.  There is therefore a significant risk that a comment
will become inaccurate as the code that is commented evolves.  It
takes great discipline on the part of the maintainer to make sure that
comments are updated to match the updated code.

Because of this disadvantage, comments should be used sparingly.  In
particular, comments should not be used as a substitute for clear
code, and comments should not paraphrase code when the meaning of that
code is clear without the comment.

It is common to see a style where each top-level function is preceded
by a comment containing information about the purpose, the arguments,
the return values, and perhaps even the author and the creation date
of the function.  This style should always be avoided.  It duplicates
information that is otherwise readily available from the code of the
function itself, or from the versioning system being used.  And such
comments are highly likely to become obsolete when the signature of
the function evolves.

To avoid some comments in favor of clear code, a good technique is
often to split up the code into smaller functions.  Then, rather than
having a commented block of code inside a big function, each such
block would be extracted into a separate function, with a name that
suggests its purpose, thereby often making the comment superfluous.

The main remaining reason for comments is then to explain some
general technique being used by the code, but that can not easily be
derived from the code itself.  If the code implements some known
algorithm, the comment may give a name and a reference to that
algorithm.  If the technique is specific to the particular code base
in question, a more elaborate description of the technique being used
may be required.

Use a single semicolon to introduce a comment that follows the code on
a line.  Use two semicolons for comments that are not at the top level
in a file and that should be aligned with the code that it comments
on.  Use three semicolons for top-level comments that concern some
top-level forms in a file, but not the entire file.  Use four
semicolons for comments that concern the entire file.

\section{Blank lines}

A single blank line is common in the following situations:

\begin{itemize}
\item Between two top-level forms.
\item Between a file-specific comment and the following top-level
  form.
\item Between a comment for several top-level forms and the first
  of those top-level forms.
\end{itemize}

A single blank line \emph{may} occur inside a top-level form to
indicate the separation of two blocks of code concerned with different
subjects, but it would be more common to put those two blocks of code
in separate functions.

There should never be any instance of two consecutive blank lines, and
the last line of the file should not be blank.

\section{\texttt{car}, \texttt{cdr}, \texttt{first}, etc are for
  \texttt{cons} cells}

The \commonlisp{} standard specifies that the function \texttt{car},
\texttt{cdr}, \texttt{first}, \texttt{second}, \texttt{rest}, etc
return \texttt{nil} when \texttt{nil} is passed as an argument.  This
fact should mostly be considered as a historical artifact and should
not be systematically exploited.  Take for instance the following
code:

\begin{verbatim}
(if (first x) ...)
\end{verbatim}

To the compiler, it means ``execute the false branch of the \texttt{if}
when either \texttt{x} is \texttt{nil}, or when \texttt{x} is a
\texttt{cons} cell in which the \texttt{car} slot is \texttt{nil}''.

To the person reading the code, it means something different
altogether, namely ``\texttt{x} holds a non-empty list of Boolean
values, and the false branch of the \texttt{if} should be executed
when the first element of that list is
\emph{false}. See also \refSec{sec-coding-style-meanings-of-nil}.

\section{Different meanings of \texttt{nil}}
\label{sec-coding-style-meanings-of-nil}

Consider the following local variable bindings:

\begin{verbatim}
(let ((x '())
      (y nil)
      z)
  ...)
\end{verbatim}

To the compiler, the three are equivalent.  To a person reading the
code, they mean different things, however:

\begin{itemize}
\item The initialization of \texttt{x} means that \texttt{x} holds a
  \emph{list} that is initially empty.
\item The initialization of \texttt{y} means that \texttt{y} holds a
  Boolean value or a default value that may or may not change in the
  body of the \texttt{let} form.
\item The absence of initialization of \texttt{z} means that no
  initial value is given to \texttt{z}.  In the body of the
  \texttt{let} form, the variable \texttt{z} will be assigned to
  before it is used.
\end{itemize}

The following body of the \texttt{let} form corresponds to the
expectations of the person reading the code:

\begin{verbatim}
(let ((x '())
      (y nil)
      z)
  ...
  (push (f y) x)
  ...
  (unless y (setf y (g x)))
  ...
  (setf z (h x))
  ...)
\end{verbatim}

The following body of the \texttt{let} form violates the expectations
of the person reading the code:

\begin{verbatim}
(let ((x '())
      (y nil)
      z)
  ...
  (push (f y) z)     ; z is used before it is assigned.
  ...
  (unless x          ; x is treated as a Boolean.
    (setf y (g x)))
  ...
  (push (f x) y)     ; y is treated as a list.
  ...)
\end{verbatim}

\section{Tests in conditional expressions}

The \emph{test} of a conditional expression should be a (possibly
generalized) Boolean expression.  The following expressions correspond
to the expectations of the person reading the code:

\begin{verbatim}
(if visited-p ...)
(when (member ...) ...)
(cond ((plusp x) ...) ...)
\end{verbatim}

The following code violates the expectation:

\begin{verbatim}
(let ((item (find ...)))
  (when item ...))
\end{verbatim}

because \texttt{item} is not a (generalized) Boolean value.  It is an
item returned by \texttt{find}, though there is an \emph{out of band}
value (\texttt{nil}) indicating that no item was found by
\texttt{find}.  In this case, the code that corresponds
to the expectations would look like this:

\begin{verbatim}
(let ((item (find ...)))
  (unless (null item) ...))
\end{verbatim}

But even this example is often not very good, because it uses
implementation details such as the container being a sequence and the
default value being nil.  A much better solution is often to create an
abstraction of those two things, for instance:

\begin{verbatim}
(let ((item (find-item ...)))
  (when (itemp item) ...))
\end{verbatim}

Here, we have introduce an abstraction \texttt{find-item} and all the
code is saying is that this function may return an object of type
\texttt{item}, or some other object indicating that no item was
found.  With these abstractions the implementation of the container
may change (to a hash table for instance) and the default value may
change as well.  An initial implementation of the abstractions would
then be:

\begin{verbatim}
(defun find-item (...)
  (find ...))

(defun itemp (item)
  (not (null item)))
\end{verbatim}

\section{General structure of recursive functions}

When possible, a recursive function should be structured like a
mathematical proof by induction.  By that we mean that the special
case should be handled \emph{first} so as to reassure the person
reading the code that this case can be handled correctly by the
function.


So for instance, assume we want to write a function that counts
the number of atoms in a tree, we should not write it like this:

\begin{verbatim}
(defun count-atoms (tree)
  (if (consp tree)
      (+ (count-atoms (car tree))
         (count-atoms (cdr tree)))
      1))
\end{verbatim}

but rather

\begin{verbatim}
(defun count-atoms (tree)
  (if (atom tree)
      1
      (+ (count-atoms (car tree))
         (count-atoms (cdr tree)))))
\end{verbatim}

Even when the base case does not return anything useful, it should be
handled first.  The following code violates the expectations:

\begin{verbatim}
(defun map-conses (function tree)
  (unless (atom node)
    (funcall function node)
    (map-conses function (car node))
    (map-conses function (cdr node))))
\end{verbatim}

and should be written like this instead:

\begin{verbatim}
(defun map-conses (function tree)
  (if (atom node)
      nil ; nothing to do
      (progn (funcall function node)
             (map-conses (car node))
             (map-conses (cdr node)))))
\end{verbatim}

though, admittedly, this example is a little too simple to illustrate
the importance of this rule.

\section{Using \texttt{car} and \texttt{cdr} vs. using \texttt{first}
  and \texttt{rest}}

While the two functions \texttt{car} and \texttt{first} have the exact
same definitions, as do \texttt{cdr} and \texttt{rest}, they send very
different messages to the person reading the code.

The functions \texttt{car}, \texttt{cdr}, etc., should be avoided when
the argument is to be considered as a \emph{list}, and should be
reserved for other uses of \texttt{cons} cells such as for
\emph{trees} or \emph{pairs} of values.

It follows that the two families of functions should never be mixed
for the same argument.

\section{Use the most specific construct}

There is a very general rule in programming that says that, when there
is a choice between two constructs that will accomplish some task,
then the most specific of those constructs should be used.

For example, the two forms \texttt{(+} \textit{expression} \texttt{1)}
and \texttt{(1+} \textit{expression}\texttt{)} both accomplish the
same task, but the second one is more specific, so it is preferred over
the first one.

There is a good reason for this rule.  It helps the person reading the
code determine the meaning of it faster.  In the particular case of
the previous example, there is an additional advantage, namely that
the meaning is determined \emph{before} the appearance of
\textit{expression}, making the meaning clear even sooner.

In \commonlisp{} there are a few exceptions to this rule. In
particular, even though \texttt{setq} is more specific than
\texttt{setf}, the trend is to use \texttt{setf} exclusively.  One
possible explanation for this exception is that \texttt{setq} is
considered obsolete for the purpose of application programming.
Similarly, one may consider \texttt{symbol-function} as obsolete, even
though it is more specific than \texttt{fdefinition}, in that it
requires its argument to be a symbol.

\section{Using packages}
\label{sec-coding-style-using-packages}

It is good practice to divide a code base into several different
\emph{modules}.  \commonlisp{} does not have a module concept, but
most of the advantages of a module can be obtained by the use of a
combination of an \texttt{asdf} system definition combined with a
\commonlisp{} \emph{package}.  Names representing the protocol
implemented by the module would then be exported by the package so
that these names can be used by client code without resorting to
breaking abstraction barriers.

We recommend that the \texttt{:use} option of the \texttt{defpackage}
form should contain only the \texttt{common-lisp} package, or
occasionally no package at all.  We recommend that, when code in some
package refers to a name in a different package (other than the
\texttt{common-lisp} package), an \emph{explicit package prefix} is
provided, as alternative to importing symbols.  There are several
reasons for this recommendation:

\begin{itemize}
\item The person reading the client code can then immediately see to
  which package a name belongs.
\item The author of packages then have greater freedom to choose names
  of exported symbols.  Some names are so common that there is a great
  risk that they will collide with names from other package, even the
  \commonlisp{} package.  Trying to avoid collisions by choosing
  different names often results in artificial-looking names.
\item When client code employs the \texttt{:use} option for an
  external package, client code assumes that no external symbols of
  the external package collide with those of other external packages.
  While that may be true when the client code is first written,
  updates to some external package may then break the client code by
  introducing more exported symbols that then collide.  By avoiding
  the \texttt{:use} option in favor of explicit package prefixes, this
  problem is avoided entirely.
\end{itemize}

While  it may seem that the use of explicit package prefixes might be
undesirable when the name of the package contains many characters,
this problem can now be entirely avoided by the use of
\emph{package-local nicknames}, or PLNs.  While package-local
nicknames are not part of the \commonlisp{} standard, they are now
supported by all major \commonlisp{} implementations, including the
two main commercial ones.

Package-local nicknames also give the author of a module greater
freedom to name the package so that it is more likely to be globally
unique.  There is no longer a reason to name a package something
common like \texttt{core}, even though it is meant to be used
exclusively in a particular implementation.

\section{Names of slot accessors}

There is an older convention that dictates that the name of a slot
accessor for some class should be prefixed by the name of the class.
We see this convention in the \commonlisp{} standard itself, as in
\texttt{symbol-name}, \texttt{package-use-list}, and
\texttt{array-total-size}.  We also see it in the CLIM
specification as in \texttt{port-name}, \texttt{sheet-parent}, etc.
It is even recommended in the literature, for example in Sonja Keene's
book on object-oriented programming in \commonlisp{}
\cite{sonya-keene}.

We do not know the origins of this convention, but all three examples
cited above seem to suggest underuse of the package system for reasons
that are unknown to us.  While the older convention is acceptable for
shallow class hierarchies, it is more dubious for deeper ones.  For
instance, since there is no subclass of \texttt{package} in
\commonlisp{}, then it follows that \texttt{package-use-list} must be
given a package as an argument.  However, in CLIM, the class
\texttt{sheet} is the base class of a fairly deep hierarchy, and
a call such as \texttt{(sheet-parent stream)} looks very confusing
indeed.

Thus, instead of this older convention, we suggest a more extensive
use of the package system, as suggested in
\refSec{sec-coding-style-using-packages}, and to omit the prefix of
the symbol itself.  Using the package system instead, has the
additional advantage that package-local nicknames can be used to
customize the prefix, again as suggested in
\refSec{sec-coding-style-using-packages}.

\section{Use of \texttt{defgeneric}}

The \commonlisp{} standard requires the occurrence of a
\texttt{defmethod} form without any corresponding preceding
\texttt{defgeneric} form to result in the creation of a generic
function, using some information derived from the \texttt{defmethod}
form.  Even early literature about the \commonlisp{} object system
\cite{Paeke:1993:OOP} suggests that \texttt{defmethod} was meant to be
the main operator for creating generic functions, and that
\texttt{defgeneric} should be used only when the default parameters
would not be adequate, such as when a different method combination was
desired.  The idea here is, of course, to make it more
\emph{convenient} for the programmer to create a generic function.

This idea of convenience has a long history in computing.  Already,
the Fortran programming language did not require explicit definition
of variables.  The first letter of the name of the variable would
determine the type, and the variable was implicitly created the first
time it was seen by the compiler.  And early Lisp variants would
silently create a special variable when \texttt{setq} was used with a
name not previously seen.

But this ``convenience'' has a high price.  A typo in a Fortran
variable name will silently create a different variable.  The same is
true for a typo used in \texttt{setq} in early Lisp.  For this reason,
over time, we tend to move away from such ``convenience'' and towards
more feedback to the programmer when it can be suspected that the
programmer made a mistake, thereby often saving many hours of
debugging time due to a silly mistake.  Hence, \commonlisp{} now
recognizes that a top-level \texttt{setq} using a name that was not
previously defined using some defining form such as \texttt{defvar} or
\texttt{defparameter} is undefined behavior.  And most modern
\commonlisp{} implementations will signal a warning for such a
top-level \texttt{setq}.

Similarly, a typo in the name given to \texttt{defmethod} will
(silently in many \commonlisp{} implementations) create a new generic
function.  More likely than a typo is the omission of a package
prefix.  For this reason, we strongly encourage maintainers of
\commonlisp{} implementations to signal a style warning for this
situations.  And we strongly encourage the systematic use of
\texttt{defgeneric}, even for slot accessors.  Using an explicit
\texttt{defgeneric} form has the additional advantage that appropriate
names can be given for lambda-list parameters.

%%  LocalWords:  accessors accessor underuse
